Automated data annotation for computer vision applications

ABSTRACT

This disclosure provides methods, devices, and systems for training machine learning models. The present implementations more specifically relate to techniques for automating the annotation of data for training machine learning models. In some aspects, a machine learning system may receive a reference image depicting an object of interest with one or more annotations and also may receive one or more input images depicting the object of interest at various distances, angles, or locations but without annotations. The machine learning system maps a set of points in the reference image to a respective set of points in each input image so that the annotations from the reference image are projected onto the object of interest in each input image. The machine learning system may further train a machine learning model to produce inferences about the object of interest based on the annotated input images.

TECHNICAL FIELD

The present implementations relate generally to machine learning, andspecifically to automated data annotation for machine learning.

BACKGROUND OF RELATED ART

Computer vision is a field of artificial intelligence (AI) that usesmachine learning to draw inferences about an environment from images ofthe environment. Machine learning is a technique for improving theability of a computer system or application to perform a certain task.Machine learning generally includes a training phase and an inferencingphase. During the training phase, a machine learning system (such as aneural network) is provided with one or more “answers” and a largevolume of raw training data associated with the answers. The machinelearning system analyzes the training data to learn a set of rules (alsoreferred to as a “model”) that can be used to describe each of the oneor more answers. During the inferencing phase, a computer visionapplication may infer answers from new data using the learned set ofrules. Example computer vision applications include object detection andobject tracking, among other examples.

Data annotation is the process of tagging or labeling training data toprovide context for the training operation. For example, when training amachine learning model to identify a particular object (or class ofobjects) in images, the machine learning system may be provided a largevolume of input images depicting the object. Each of the input imagesmay be annotated to ensure that the machine learning system can learn aset of features that uniquely describes the target object to theexclusion of any other objects that may be depicted in the input images.Example suitable annotations may include, among other examples, abounding box surrounding the target object in each of the input imagesand contextual information labeling the target object within eachbounding box.

Existing data annotation techniques rely on human operators to reviewand annotate each input image (or other training data) to be used fortraining. Due to the large volume of input images required for training,human operators may require hundreds of hours (if not longer) toconstruct an annotated set of input images. Thus, there is a need tomore efficiently annotate training data.

SUMMARY

This Summary is provided to introduce in a simplified form a selectionof concepts that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tolimit the scope of the claimed subject matter.

One innovative aspect of the subject matter of this disclosure can beimplemented in a method of training a machine learning model. The methodincludes steps of receiving a first input image depicting an object ofinterest; receiving a reference image depicting the object of interestand one or more annotations associated with the object of interest;mapping a plurality of first points in the reference image to arespective plurality of second points in the first input image so thatthe one or more annotations in the reference image are projected ontothe object of interest in the first input image; and training themachine learning model to produce inferences from images depicting theobject of interest based at least in part on the mapping of theplurality of first points to the plurality of second points.

Another innovative aspect of the subject matter of this disclosure canbe implemented in a machine learning system including a processingsystem and a memory. The memory stores instructions that, when executedby the processing system, causes the machine learning system to receivea first input image depicting an object of interest; receive a referenceimage depicting the object of interest and one or more annotationsassociated with the object of interest; map a plurality of first pointsin the reference image to a respective plurality of second points in thefirst input image so that the one or more annotations in the referenceimage are projected onto the object of interest in the first inputimage; and train the machine learning model to produce inferences fromimages depicting the object of interest based at least in part on themapping of the plurality of first points to the plurality of secondpoints.

BRIEF DESCRIPTION OF THE DRAWINGS

The present embodiments are illustrated by way of example and are notintended to be limited by the figures of the accompanying drawings.

FIG. 1 shows a block diagram of an example computer vision system,according to some implementations.

FIG. 2 shows a block diagram of a machine learning system, according tosome implementations.

FIG. 3A shows an example mapping that relates a set of points on anannotated reference image to a respective set of points on an inputimage.

FIG. 3B shows an example of an annotated input image that can beproduced as a result of the mapping depicted in FIG. 3A.

FIG. 4 shows a block diagram of an image annotation system, according tosome implementations.

FIG. 5 shows an example mapping of reference image data to input imagedata based on a homography.

FIG. 6 shows another example mapping of reference image data to inputimage data based on a homography.

FIG. 7 shows an example mapping of reference annotation data to inputannotation data based on a homgraphy.

FIG. 8 shows an example machine learning system, according to someimplementations.

FIG. 9 shows an illustrative flowchart depicting an example operationfor training a machine learning model, according to someimplementations.

DETAILED DESCRIPTION

In the following description, numerous specific details are set forthsuch as examples of specific components, circuits, and processes toprovide a thorough understanding of the present disclosure. The term“coupled” as used herein means connected directly to or connectedthrough one or more intervening components or circuits. The terms“electronic system” and “electronic device” may be used interchangeablyto refer to any system capable of electronically processing information.Also, in the following description and for purposes of explanation,specific nomenclature is set forth to provide a thorough understandingof the aspects of the disclosure. However, it will be apparent to oneskilled in the art that these specific details may not be required topractice the example embodiments. In other instances, well-knowncircuits and devices are shown in block diagram form to avoid obscuringthe present disclosure. Some portions of the detailed descriptions whichfollow are presented in terms of procedures, logic blocks, processingand other symbolic representations of operations on data bits within acomputer memory.

These descriptions and representations are the means used by thoseskilled in the data processing arts to most effectively convey thesubstance of their work to others skilled in the art. In the presentdisclosure, a procedure, logic block, process, or the like, is conceivedto be a self-consistent sequence of steps or instructions leading to adesired result. The steps are those requiring physical manipulations ofphysical quantities. Usually, although not necessarily, these quantitiestake the form of electrical or magnetic signals capable of being stored,transferred, combined, compared, and otherwise manipulated in a computersystem. It should be borne in mind, however, that all of these andsimilar terms are to be associated with the appropriate physicalquantities and are merely convenient labels applied to these quantities.

Unless specifically stated otherwise as apparent from the followingdiscussions, it is appreciated that throughout the present application,discussions utilizing the terms such as “accessing,” “receiving,”“sending,” “using,” “selecting,” “determining,” “normalizing,”“multiplying,” “averaging,” “monitoring,” “comparing,” “applying,”“updating,” “measuring,” “deriving” or the like, refer to the actionsand processes of a computer system, or similar electronic computingdevice, that manipulates and transforms data represented as physical(electronic) quantities within the computer system's registers andmemories into other data similarly represented as physical quantitieswithin the computer system memories or registers or other suchinformation storage, transmission or display devices.

In the figures, a single block may be described as performing a functionor functions; however, in actual practice, the function or functionsperformed by that block may be performed in a single component or acrossmultiple components, and/or may be performed using hardware, usingsoftware, or using a combination of hardware and software. To clearlyillustrate this interchangeability of hardware and software, variousillustrative components, blocks, modules, circuits, and steps have beendescribed below generally in terms of their functionality. Whether suchfunctionality is implemented as hardware or software depends upon theparticular application and design constraints imposed on the overallsystem. Skilled artisans may implement the described functionality invarying ways for each particular application, but such implementationdecisions should not be interpreted as causing a departure from thescope of the present invention. Also, the example input devices mayinclude components other than those shown, including well-knowncomponents such as a processor, memory and the like.

The techniques described herein may be implemented in hardware,software, firmware, or any combination thereof, unless specificallydescribed as being implemented in a specific manner. Any featuresdescribed as modules or components may also be implemented together inan integrated logic device or separately as discrete but interoperablelogic devices. If implemented in software, the techniques may berealized at least in part by a non-transitory processor-readable storagemedium including instructions that, when executed, performs one or moreof the methods described above. The non-transitory processor-readabledata storage medium may form part of a computer program product, whichmay include packaging materials.

The non-transitory processor-readable storage medium may comprise randomaccess memory (RAM) such as synchronous dynamic random-access memory(SDRAM), read only memory (ROM), non-volatile random-access memory(NVRAM), electrically erasable programmable read-only memory (EEPROM),FLASH memory, other known storage media, and the like. The techniquesadditionally, or alternatively, may be realized at least in part by aprocessor-readable communication medium that carries or communicatescode in the form of instructions or data structures and that can beaccessed, read, and/or executed by a computer or other processor.

The various illustrative logical blocks, modules, circuits andinstructions described in connection with the embodiments disclosedherein may be executed by one or more processors (or a processingsystem). The term “processor,” as used herein may refer to anygeneral-purpose processor, special-purpose processor, conventionalprocessor, controller, microcontroller, and/or state machine capable ofexecuting scripts or instructions of one or more software programsstored in memory.

Various aspects relate generally to machine learning, and morespecifically, to techniques for automating the annotation of data fortraining machine learning models. In some aspects, a machine learningsystem may receive a reference image depicting an object of interestwith one or more annotations and also may receive one or more inputimages depicting the object of interest at various distances, angles, orlocations but without annotations. The machine learning system maps aset of points in the reference image to a respective set of points ineach input image so that the annotations from the reference image areprojected onto the object of interest in each input image. In someimplementations, the mapping may be based on a homography. For example,the machine learning system may calculate a respective homography foreach input image that transforms the points in the reference image tothe respective points in the input image. Each homography may be appliedto the annotations in the reference image to annotate a respective inputimage. The machine learning system may further train a machine learningmodel to produce inferences about the object of interest based on theannotated input images.

Particular implementations of the subject matter described in thisdisclosure can be implemented to realize one or more of the followingpotential advantages. By mapping various points in a reference image torespective points in an input image, aspects of the present disclosuremay substantially automate the process of data annotation. For example,a human operator can annotate a single reference image and utilize themachine learning system to programmatically transfer the annotationsfrom the reference image to a large volume of input images. In contrastwith existing data annotation techniques, which require a human operatorto manually annotate each input image, the data annotation techniques ofthe present disclosure can substantially reduce the time and costassociated with training machine learning models or may allow more inputdata to be annotated and processed for training during the same intervalof time.

FIG. 1 shows a block diagram of an example computer vision system 100,according to some implementations. In some aspects, the computer visionsystem 100 may be configured to generate inferences about one or moreobjects of interest (also referred to as “target objects”). In theexample of FIG. 1 , an object of interest 101 may be any device capableof displaying a dynamic sequence of digits or numbers on a substantiallyflat or planar surface. Example suitable objects of interest includewater meters, electrical meters, or various other digital or analogmetering devices, among other examples. In some other implementations,the computer vision system 100 may be configured to generate inferencesabout various other objects of interest in addition to, or in lieu of,the object of interest 101.

The system 100 includes an image capture component 110 and an imageanalysis component 120. The image capture component 110 may be anysensor or device (such as a camera) configured to capture a pattern oflight in its field-of-view (FOV) 112 and convert the pattern of light toa digital image 102. For example, the digital image 102 may include anarray of pixels (or pixel values) representing the pattern of light inthe FOV 112 of the image capture component 110. As shown in FIG. 1 , theobject of interest 101 is located within the FOV 112 of the imagecapture component 110. As a result, the digital images 102 may includethe object of interest 101.

The image analysis component 120 is configured to produce one or moreinferences 103 based on the digital image 102. In some aspects, theimage analysis component 120 may generate inferences about the object ofinterest 101 depicted in the image 102. For example, the image analysiscomponent 120 may detect the object of interest 101 in the digital image102 and infer the numbers displayed thereon. In other words, the imageanalysis component 120 may output a numerical value (such as “012345”),as an inference 103, representing an interpretation or reading of thedigits displayed by the object of interest 101.

In some implementations, the image analysis component 120 may generatethe inference 103 based on a machine learning (ML) model 122. Machinelearning is a technique for improving the ability of a computer systemor application to perform a certain task. During a training phase, amachine learning system may be provided with multiple “answers” and oneor more sets of raw data to be mapped to each answer. For example, amachine learning system may be trained to read the digits displayed bythe object of interest 101 by providing the machine learning system witha large number of images depicting the object of interest 101 (whichrepresents the raw data) and contextual information indicating theactual values of the digits displayed by the object of interest 101 ineach image (which represents the answers).

The machine learning system analyzes the raw data to “learn” a set ofrules that can be used to read or interpret the digits displayed by thesame (or similar) object of interest 101 in other images. For example,the machine learning system may perform statistical analysis on the rawdata to determine a common set of features (also referred to as “rules”)that can be associated with each number or digit that can be displayedby the object of interest 101. Deep learning is a particular form ofmachine learning in which the model being trained is a multi-layerneural network. Deep learning architectures are often referred to asartificial neural networks due to the way in which information isprocessed (similar to a biological nervous system).

For example, each layer of the deep learning architecture is formed by anumber of artificial neurons. The neurons are interconnected across thevarious layers so that input data (or the raw data) may be passed fromone layer to another. More specifically, each layer of neurons mayperform a different type of transformation on the input data that willultimately result in a desired output (such as a numerical prediction).The interconnected framework of neurons may be referred to as a neuralnetwork model. In some implementations, the ML model 122 may be a neuralnetwork model. As such, the ML model 122 may include a set of rules thatcan be used to infer the values of each digit displayed by the object ofinterest 101.

In some aspects, data annotations may help guide the machine learningsystem to train a robust and accurate ML model 122. Data annotation isthe process of tagging or labeling training data to provide context forthe training operation. For example, each of the input images may beannotated to ensure that the machine learning system can learn a set offeatures that uniquely describes the object of interest 101 to theexclusion of any other objects or features that may be included in theinput images. Example suitable annotations may include bounding boxessurrounding the digits displayed by the object of interest 101 andcontextual information labeling or otherwise identifying the digit(s)bound by each bounding box.

Existing data annotation techniques rely on human operators to reviewand annotate each input image in a training set provided to a machinelearning system. However, as described above, the machine learningsystem may require a large volume of input images to train a robust andaccurate ML model. Moreover, each input image in a training set maydepict the object of interest 101 at a different distance, angle,location, or under different lighting conditions. As a result, humanoperators may require hundreds of hours (if not longer) to annotate eachinput image in a given training set.

In some aspects, a machine learning system (or image annotation system)may annotate a large volume of input images with little or noinvolvement by a human operator. More specifically, the machine learningsystem may copy or transfer a set of annotations from an annotatedreference image to each input image in a training set. In someimplementations, the machine learning system may transfer theannotations from the reference image to each input image based on amapping (such as a homography) between various points on the referenceimage to respective points on each input image. As a result, a humanoperator can manually annotate a single reference image and utilize themachine learning system to annotate the remaining input images in atraining set.

FIG. 2 shows a block diagram of a machine learning system 200, accordingto some implementations. In some aspects, the machine learning system200 may be configured to produce a neural network model 207 based, atleast in part, on one or more annotated reference images 201 and a setof input images 202. An annotated reference image 201 depicts an objectof interest with one or more annotations. The input images 202 maydepict the object of interest at various distances, angles, locations,or under various lighting conditions, but without the annotationsincluded in the annotated reference image 201. In some aspects, theobject of interest may be configured to display a sequence of digits(such as the object of interest 101 of FIG. 1 ). In someimplementations, the neural network model 207 may be one example of theML model 122 of FIG. 1 . Thus, the neural network model 207 may includea set of rules that can be used to infer a value of each digit displayedby the object of interest.

The machine learning system 200 includes an image annotator 210, aneural network 220, and a loss calculator 230. In some aspects, theimage annotator 210 may annotate each input image 202, as a respectiveannotated input image 203, based on the annotated reference image 201.Aspects of the present disclosure recognize that, because the same (orsimilar) object of interest is depicted in each of the reference image201 and the input images 202, at least a portion of the reference image201 and a respective portion of an input image 202 may depict the sameset of features on the object of interest. Thus, the image annotator 210may map a set of points on the reference image 201 to a respective setof points on an input image 202, where the mapped points coincide withthe same feature(s) on the object of interest. In some implementations,the image annotator 210 may copy or transfer the annotations from thereference image 201 to a respective annotated input image 203 as aresult of the mapping.

FIG. 3A shows an example mapping 300 that relates a set of points on anannotated reference image 310 to a respective set of points on an inputimage 320. In some implementations, the mapping 300 may be performed bythe image annotator 210 of FIG. 2 . Thus, the annotated reference image310 may be one example of the annotated reference image 201 and theinput image 320 may be one example of any of the input images 202.

As shown in FIG. 3A, the annotated reference image 310 depicts an objectof interest 301 with a number of annotations 302. In some aspects, theobject of interest 301 may be any device configured to display asequence of digital or analog digits (such as a meter). The digits canbe displayed within a display region (depicted as a black rectangle) atthe center of the object of interest 301. In the example of FIG. 3A, theannotations 302 are shown to include a number of bounding boxes(depicted as six rectangular boxes each coinciding with a respectivedigit that can be displayed within the display region) and an image mask(depicted as a rectangular box that circumscribes the display region).By contrast, the input image 320 depicts the object of interest 301 witha sequence of digits (“123456”) displayed in the display region, butwithout annotations. Further, the object of interest 301 is depicted ata different angle in the input image 320 than in the annotated referenceimage 310.

In some implementations, the mapping 300 may correlate various points330 on the annotated reference image 310 (which includes image points331-334) with respective points 340 on the input image 320 (whichincludes image points 341-344). In some implementations, the imagepoints 330 and 340 may be associated with various features of the objectof interest 301 (also referred to as “feature points”). For example, thetop-left corner of the display region may be mapped to points 331 and341 on the images 310 and 320, respectively; the top-right corner of thedisplay region may be mapped to points 332 and 342 on the images 310 and320, respectively; the bottom-left corner of the display region may bemapped to points 333 and 343 on the images 310 and 320, respectively;and the bottom-right corner of the display region may be mapped topoints 334 and 344 on the images 310 and 320, respectively. As such, themapping 300 may transform any image point in the first set 330 to arespective image point in the second set 340.

In some aspects, the mapping 300 may project the annotations 302 fromthe annotated reference image 310 onto the input image 320. For example,the annotations 302 may coincide with (or overlap) a subset of the imagepoints 330 in the annotated reference image 310. As such, the mapping300 may transform the image points 330 associated with the annotations302 into respective image points 340 in the input image 320. In someaspects, the image annotator 210 may use the mapping 300 to annotate theinput image 320. For example, the image annotator 210 may reproduce theannotations 302 on the subset of image points 340 (in the input image320) associated therewith. FIG. 3B shows an example of an annotatedinput image 330 that can be produced as a result of the mapping 300depicted in FIG. 3A. As shown in FIG. 3B, the annotated input image 330depicts the object of interest 301 with the annotations 302. In someimplementations, the annotated input image 330 may be one example of anyof the annotated input images 203 of FIG. 2 .

Referring back to FIG. 2 , the neural network 220 receives each of theannotated input images 203 and attempts to read the numbers displayedthereon. For example, the object of interest may display a differentsequence of numbers in each of the input images 202 (and thus, in eachof the annotated input images 203). The actual numbers displayed in eachof the input images 202 may be provided as contextual information 205 tothe machine learning system 200. In some implementations, the neuralnetwork 220 may analyze the contents of the bounding boxes in each ofthe annotated input image 203 to predict the numbers displayed therein.For example, the neural network 220 may form a network of connectionsacross multiple layers of artificial neurons that begin with anannotated input image 203 and lead to a prediction 204. The connectionsare weighted to result in a prediction 204 that reflects the numbersdisplayed by the object of interest in the annotated input image 203.

The loss calculator 230 compares the prediction 204 with the contextualinformation 205 to determine an amount of loss (or error) between thepredicted values of the digits displayed by the object of interest andtheir actual values. The loss calculator 230 may further update a set ofweights 206 associated with the connections in the neural network 220based on the amount of loss. In some implementations, the trainingoperation may be performed over multiple iterations. During eachiteration, the neural network 220 produces a respective prediction 204based on the weighted connections across the layers of artificialneurons, and the loss calculator 230 updates the weights 206 associatedwith the connections based on the amount of loss between the prediction204 and the contextual information 205. The neural network 220 mayoutput the weighted connections, as the neural network model 207, whencertain convergence criteria are met (such as when the loss falls belowa threshold level or after a predetermined number of iterations of thetraining operation have been completed).

FIG. 4 shows a block diagram of an image annotation system 400,according to some implementations. In some implementations, the imageannotation system 400 may be one example of the image annotator 210 ofFIG. 2 . Thus, the image annotation system 400 may be configured toannotate a set of input images (represented by input image data 422)based on an annotated reference image 410. In some implementations, theannotated reference image 410 may be one example of the annotatedreference image 201 of FIG. 2 . More specifically, the annotatedreference image 410 includes reference image data 412 and referenceannotation data 414. The reference image data 412 represents an imagedepicting an object of interest as captured or acquired by an imagecapture device (such as raw image data). By contrast, the referenceannotation data 414 includes one or more annotations associated with theobject of interest depicted by the reference image data 412.

The image annotation system 400 includes a mapping calculator 402 and anannotation mapper 404. The mapping calculator 402 is configured todetermine a mapping 405 that relates at least a portion of the referenceimage data 412 with a respective portion of the input image data 422. Insome implementations, the input image data 422 may represent an inputimage such as any of the input images 202 of FIG. 2 . Thus, the inputimage data 422 also depicts the same (or similar) object of interest asthe reference image data 412. Aspects of the present disclosurerecognize that some objects of interest (such as meters) tend to havesubstantially flat or planar surfaces. Thus, in some implementations,the mapping 405 may be a homography.

A homography is a 3×3 matrix (H) that maps any point (x₁,y₁,z₁) on afirst plane (P1) to a respective point (x₂,y₂,z₂) on a second plane(P2), where:

$\begin{bmatrix}x_{2} \\y_{2} \\z_{2}\end{bmatrix} = {{H\begin{bmatrix}x_{1} \\y_{1} \\z_{1}\end{bmatrix}} = {\begin{bmatrix}h_{00} & h_{01} & h_{02} \\h_{10} & h_{11} & h_{12} \\h_{20} & h_{21} & h_{22}\end{bmatrix}\begin{bmatrix}x_{1} \\y_{1} \\z_{1}\end{bmatrix}}}$

As shown in the equation above, a homography H has 8 degrees of freedom.Thus, at least 4 corresponding points are needed on each of the planesP1 and P2 to calculate a homography H. In some implementations, themapping calculator 402 may calculate a homography H that relates a firstplane (P1) in the reference image with a second plane (P2) in the inputimage based on a random sample consensus (RANSAC) algorithm. Forexample, the mapping calculator 402 may randomly sample 4 points in eachof the images and calculate a homography H based on the 4 pairs ofpoints. The mapping calculator 402 may further determine a set ofinliers that satisfy the homography H and iteratively refine thehomography H based on the inliers.

FIG. 5 shows an example mapping of reference image data 510 to inputimage data 520 based on a homography 505. In some implementations, thehomography 505 may be calculated by the mapping calculator 402 of FIG. 4. As such, the reference image data 510, the input image data 520, andthe homography 505 may be examples of the reference image data 412, theinput image data 422, and the mapping 405, respectively, of FIG. 4 .

As shown in FIG. 5 , the reference image data 510 depicts an object ofinterest 501 without any annotations. In some aspects, the object ofinterest 501 may be any device configured to display a sequence ofdigital or analog digits (such as the object of interest 301 of FIG. 3). The digits can be displayed within a display region (depicted as ablack rectangle) at the center of the object of interest 501. The inputimage data 520 also depicts the object of interest 501 without anyannotations. However, the object of interest 501 is depicted at adifferent angle by the input image data 520 than by the reference imagedata 510. Moreover, the input image data 520 depicts the object ofinterest 501 with a sequence of numbers (“123456”) displayed thereon. Asshown in FIG. 5 , the display region of the object of interest 501 lieson a first plane P1 in a coordinate space associated with the referenceimage data 510 and also lies on a second plane P2 in a coordinate spaceassociated with the input image data 520.

In some implementations, the homography 505 may relate each image pointin the first plane P1 to a respective image point in the second planeP2. As shown in FIG. 5 , the corners of the display of the object ofinterest 501 map to image points 511-514 in the first plane P1 and mapto image points 521-524 in the second plane P2. Thus, the homography 505may transform the image points 511-514 in the first plane P1 to theimage points 521-524, respectively, in the second plane P2. Aspects ofthe present disclosure recognize that the robustness of a homographydepends on the number of image points available in each of the planes P1and P2. Thus, in some implementations, a more robust homography can becalculated by extending the planar surface of the object of interest501.

FIG. 6 shows another example mapping of reference image data 610 toinput image data 620 based on a homography 605. In some implementations,the homography 605 may be calculated by the mapping calculator 402 ofFIG. 4 . As such, the reference image data 610, the input image data620, and the homography 605 may be examples of the reference image data412, the input image data 422, and the mapping 405, respectively, ofFIG. 4 .

As shown in FIG. 6 , the reference image data 610 depicts an object ofinterest 601 without any annotations. In some aspects, the object ofinterest 601 may be any device configured to display a sequence ofdigital or analog digits (such as the object of interest 301 of FIG. 3). The digits can be displayed within a display region (depicted as ablack rectangle) at the center of the object of interest 601. The inputimage data 620 also depicts the object of interest 601 without anyannotations. However, the object of interest 601 is depicted at adifferent angle by the input image data 620 than by the reference imagedata 610. Moreover, the input image data 620 depicts the object ofinterest 601 with a sequence of numbers (“123456”) displayed thereon. Asshown in FIG. 6 , the display region of the object of interest 601 lieson a first plane P1 in a coordinate space associated with the referenceimage data 610 and also lies on a second plane P2 in a coordinate spaceassociated with the input image data 620.

In some implementations, the homography 605 may relate each image pointin the first plane P1 to a respective image point in the second planeP2. However, in contrast with FIG. 5 , a planar extension 602 is addedor attached to the surface of the object of interest 601 in each of theimages 610 and 620 of FIG. 6 . For example, the planar extension 602 maybe any object having a surface that is coplanar with the surface of theobject of interest 601 (such as a sheet of paper, cardboard, or wood,among other examples). Accordingly, the homography 605 may transform aset of image points 611-614 that lies beyond the object of interest 601in the first plane P1 to respective image points 621-624 that lie beyondthe object of interest 601 in the second plane P2. Accordingly, thehomography 605 may be more robust than the homography 505 of FIG. 5 .

Referring back to FIG. 4 , the annotation mapper 404 may map or convertthe reference annotation data 414 to input annotation data 424 based onthe mapping 405. In some implementations, the annotation mapper 404 mayapply the homgraphy H (calculated by the mapping calculator 402) to thereference annotation data 414. As described with reference to FIGS. 5and 6 , the homgraphy H transforms each image point in a first plane P1associated with the reference image data 412 to a respective image pointin a second plane P2 associated with the input image data 422, where theplanes P1 and P2 coincide with a surface of the object of interest.Because the annotations are projected onto the surface of the object ofinterest, the homography H transforms the reference annotation data 414from the coordinate space of the reference image data 412 to thecoordinate space of the input image data 422. As a result, the inputimage data 422 and the input annotation data 424 may collectivelyrepresent an annotated input image 420.

FIG. 7 shows an example mapping 700 of reference annotation data 710 toinput annotation data 720 based on a homography (H). In someimplementations, the mapping 700 may be performed by the annotationmapper 404 of FIG. 4 . As such, the reference annotation data 710, theinput annotation data 720, and the homography H, may be one example ofthe reference annotation data 414, the input annotation data 424, andthe mapping 405, respectively, of FIG. 4 . In some implementations, thehomography H may be one example of any of the homographies 505 or 605 ofFIGS. 5 and 6 , respectively.

As shown in FIG. 7 , the reference annotation data 710 represents a setof annotations 702 associated with an annotated reference image (such asthe annotated reference image 310 of FIG. 3 ). In the example of FIG. 7, the annotations 702 are shown to include a number of bounding boxes(depicted as six rectangular boxes each coinciding with a respectivedigit that can be displayed in a display region of an object ofinterest) and an image mask (depicted as a rectangular box thatcircumscribes the display region of the object of interest). Withreference for example to FIG. 3 , the annotations 702 may be one exampleof the annotations 302. As shown in FIG. 7 , the annotations 702 lie ona first plane P1. As described with reference to FIGS. 5 and 6 , thefirst plane P1 is defined by a coordinate space associated with thereference image data 510 or 610, respectively. More specifically, thefirst plane P1 coincides with a surface of the object of interest 501 or601 depicted by the reference image data 510 or 610, respectively.

In some implementations, the homography H may transform the annotations702 from the coordinate space associated with the reference image data510 or 610 to a coordinate space associated with the input image data520 or 620 of FIGS. 5 and 6 , respectively. More specifically, thehomography H transfers the annotations 702 from the first plane P1 to asecond plane P2. As described with reference to FIGS. 5 and 6 , thesecond plane P2 coincides with the surface of the object of interest asdepicted by the input image data 520 and 620, respectively. Thus, theinput annotation data 720 may be combined with the input image data 520or 620 to produce an annotated input image (such as the annotated inputimage 330 of FIG. 3B). For example, the annotations 702 may be projectedor overlaid upon the input image data 520 or 620 in the annotated inputimage.

FIG. 8 shows an example machine learning system 800, according to someimplementations. In some implementations, the machine learning system800 may be one example of the machine learning system 200 of FIG. 2 .Thus, the machine learning system 800 may be configured to produce an MLmodel 808 based, at least in part, on a number of input images 802 andan annotated reference image 804. In some implementations, the inputimages 802 and the annotated reference image 804 may be examples of theinput images 202 and the annotated reference image 201, respectively, ofFIG. 2 . In some implementations, the machine learning system 800 mayinclude a processing system 810 and a memory 820.

The memory 820 may include an image data store 822 configured to storethe input image 802, the annotated reference image 804, and anyintermediate images produced by the machine learning system 800 (such asannotated input images). In some implementations, the annotatedreference image 804 may depict an object of interest with one or moreannotations. By contrast, the input images 802 may depict the same (orsimilar) object of interest at various distances, angles, locations, orunder various lighting conditions, but without the annotations includedin the annotated reference image 804.

The memory 820 also may include a non-transitory computer-readablemedium (including one or more nonvolatile memory elements, such asEPROM, EEPROM, Flash memory, a hard drive, and the like) that may storeat least the following software (SW) modules:

-   -   an annotation mapping SW module 824 to map a plurality of first        points in the annotated reference image 804 to a respective        plurality of second points in an input image 802 so that the one        or more annotations in the reference image 804 are projected        onto the object of interest in the input image 802; and    -   a model training SW module 826 to train the ML model 808 to        produce inferences from images depicting the object of interest        based at least in part on the mapping of the plurality of first        points to the plurality of second points.        Each software module includes instructions that, when executed        by the processing system 810, causes the machine learning system        800 to perform the corresponding functions.

The processing system 810 may include any suitable one or moreprocessors capable of executing scripts or instructions of one or moresoftware programs stored in the machine learning system 800 (such as inmemory 820). For example, the processing system 810 may executeannotation mapping SW module 824 to map a plurality of first points inthe annotated reference image 804 to a respective plurality of secondpoints in an input image 802 so that the one or more annotations in thereference image 804 are projected onto the object of interest in theinput image 802. The processing system 610 may further execute the modeltraining SW module 826 to train the ML model 808 to produce inferencesfrom images depicting the object of interest based at least in part onthe mapping of the plurality of first points to the plurality of secondpoints.

FIG. 9 shows an illustrative flowchart 900 depicting an exampleoperation for training a machine learning model, according to someimplementations. In some implementations, the example operation 900 maybe performed by a machine learning system (such as the machine learningsystem 200 of FIG. 2 ) to train a neural network to infer a value ofeach digit displayed by an object of interest (such as a meter).

The machine learning system receives a first input image depicting anobject of interest (910). In some implementations, the object ofinterest may be configured to display a sequence of digits. The machinelearning system also receives a reference image depicting the object ofinterest and one or more annotations associated with the object ofinterest (920). In some implementations, the object of interest may bedepicted at a different distance, angle, or location in the referenceimage than in the first input image. In some implementations, the one ormore annotations may include an image mask that delineates the sequenceof digits from the remainder of the object of interest. In someimplementations, the one or more annotations may include one or morebounding boxes each coinciding with a respective digit in the sequenceof digits.

The machine learning system maps a plurality of first points in thereference image to a respective plurality of second points in the firstinput image so that the one or more annotations in the reference imageare projected onto the object of interest in the first input image(930). In some implementations, the mapping of the plurality of firstpoints to the plurality of second points may include calculating ahomography that transforms the plurality of first points to theplurality of second points. The machine learning system further trainsthe machine learning model to produce inferences from images depictingthe object of interest based at least in part on the mapping of theplurality of first points to the plurality of second points (940). Insome implementations, the training of the machine learning model mayinclude annotating the first input image based on the homography and theone or more annotations in the reference image.

In some implementations, the machine learning system may further receivecontextual information indicating a value of each digit in the sequenceof digits displayed by the object of interest in the first input image.In such implementations, the training of the machine learning model maybe further based on the received contextual information. In someimplementations, the inferences include a numerical value associatedwith the sequence of digits displayed by the object of interest in eachof the images.

In some implementations, the machine learning system may further receivea second input image depicting the object of interest and map aplurality of third points in the reference image to a respectiveplurality of fourth points in the second input image so that the one ormore annotations in the reference image are projected onto the object ofinterest in the second input image. In such implementations, thetraining of the machine learning model may be further based on themapping of the plurality of third points to the plurality of fourthpoints.

In some implementations, the object of interest may be depicted at adifferent distance, angle, or location in the first input image than inthe second input image. In some implementations, the object of interestmay be depicted under different lighting conditions in the first inputimage than in the second input image. In some implementations, thesequence of digits displayed by the object of interest in the firstinput image may have a different numerical value than the sequence ofdigits displayed by the object of interest in the second input image.

Those of skill in the art will appreciate that information and signalsmay be represented using any of a variety of different technologies andtechniques. For example, data, instructions, commands, information,signals, bits, symbols, and chips that may be referenced throughout theabove description may be represented by voltages, currents,electromagnetic waves, magnetic fields or particles, optical fields orparticles, or any combination thereof.

Further, those of skill in the art will appreciate that the variousillustrative logical blocks, modules, circuits, and algorithm stepsdescribed in connection with the aspects disclosed herein may beimplemented as electronic hardware, computer software, or combinationsof both. To clearly illustrate this interchangeability of hardware andsoftware, various illustrative components, blocks, modules, circuits,and steps have been described above generally in terms of theirfunctionality. Whether such functionality is implemented as hardware orsoftware depends upon the particular application and design constraintsimposed on the overall system. Skilled artisans may implement thedescribed functionality in varying ways for each particular application,but such implementation decisions should not be interpreted as causing adeparture from the scope of the disclosure.

The methods, sequences or algorithms described in connection with theaspects disclosed herein may be embodied directly in hardware, in asoftware module executed by a processor, or in a combination of the two.A software module may reside in RAM memory, flash memory, ROM memory,EPROM memory, EEPROM memory, registers, hard disk, a removable disk, aCD-ROM, or any other form of storage medium known in the art. Anexemplary storage medium is coupled to the processor such that theprocessor can read information from, and write information to, thestorage medium. In the alternative, the storage medium may be integralto the processor.

In the foregoing specification, embodiments have been described withreference to specific examples thereof. It will, however, be evidentthat various modifications and changes may be made thereto withoutdeparting from the broader scope of the disclosure as set forth in theappended claims. The specification and drawings are, accordingly, to beregarded in an illustrative sense rather than a restrictive sense.

What is claimed is:
 1. A method of training a machine learning model,comprising: receiving a first input image depicting an object ofinterest; receiving a reference image depicting the object of interestand one or more annotations associated with the object of interest;mapping a plurality of first points in the reference image to arespective plurality of second points in the first input image so thatthe one or more annotations in the reference image are projected ontothe object of interest in the first input image; and training themachine learning model to produce inferences from images depicting theobject of interest based at least in part on the mapping of theplurality of first points to the plurality of second points.
 2. Themethod of claim 1, wherein the mapping of the plurality of first pointsto the plurality of second points comprises: calculating a homographythat transforms the plurality of first points to the plurality of secondpoints.
 3. The method of claim 2, wherein the training of the machinelearning model comprises: annotating the first input image based on thehomography and the one or more annotations in the reference image. 4.The method of claim 1, wherein the object of interest is depicted at adifferent distance, angle, or location in the reference image than inthe first input image.
 5. The method of claim 1, wherein the object ofinterest is configured to display a sequence of digits.
 6. The method ofclaim 5, wherein the one or more annotations include an image mask thatdelineates the sequence of digits from the remainder of the object ofinterest.
 7. The method of claim 5, wherein the one or more annotationsinclude one or more bounding boxes each coinciding with a respectivedigit in the sequence of digits.
 8. The method of claim 5, furthercomprising: receiving contextual information indicating a value of eachdigit in the sequence of digits displayed by the object of interest inthe first input image, the training of the machine learning model beingfurther based on the received contextual information.
 9. The method ofclaim 8, wherein the inferences include a numerical value associatedwith the sequence of digits displayed by the object of interest in eachof the images.
 10. The method of claim 5, further comprising: receivinga second input image depicting the object of interest; and mapping aplurality of third points in the reference image to a respectiveplurality of fourth points in the second input image so that the one ormore annotations in the reference image are projected onto the object ofinterest in the second input image, the training of the machine learningmodel being further based on the mapping of the plurality of thirdpoints to the plurality of fourth points.
 11. The method of claim 10,wherein the object of interest is depicted at a different distance,angle, or location in the first input image than in the second inputimage.
 12. The method of claim 10, wherein the object of interest isdepicted under different lighting conditions in the first input imagethan in the second input image.
 13. The method of claim 10, wherein thesequence of digits displayed by the object of interest in the firstinput image has a different numerical value than the sequence of digitsdisplayed by the object of interest in the second input image.
 14. Amachine learning system comprising: a processing system; and a memorystoring instructions that, when executed by the processing system,causes the machine learning system to: receive a first input imagedepicting an object of interest; receive a reference image depicting theobject of interest and one or more annotations associated with theobject of interest; map a plurality of first points in the referenceimage to a respective plurality of second points in the first inputimage so that the one or more annotations in the reference image areprojected onto the object of interest in the first input image; andtrain the machine learning model to produce inferences from imagesdepicting the object of interest based at least in part on the mappingof the plurality of first points to the plurality of second points. 15.The machine learning system of claim 14, wherein the mapping of theplurality of first points to the plurality of second points comprises:calculating a homography that transforms the plurality of first pointsto the plurality of second points.
 16. The machine learning system ofclaim 15, wherein the training of the machine learning model comprises:annotating the first input image based on the homography and the one ormore annotations in the reference image.
 17. The machine learning systemof claim 14, wherein the object of interest is configured to display asequence of digits and the one or more annotations include an image maskthat delineates the sequence of digits from the remainder of the objectof interest or include one or more bounding boxes each coinciding with arespective digit in the sequence of digits.
 18. The machine learningsystem of claim 17, wherein execution of the instructions further causesthe machine learning system to: receive contextual informationindicating a value of each digit in the sequence of digits displayed bythe object of interest in the first input image, the training of themachine learning model being further based on the received contextualinformation.
 19. The machine learning system of claim 18, wherein theinferences include a numerical value associated with the sequence ofdigits displayed by the object of interest in each of the images. 20.The machine learning system of claim 17, wherein execution of theinstructions further causes the machine learning system to: receive asecond input image depicting the object of interest, the sequence ofdigits displayed by the object of interest in the first input imagehaving a different numerical value than the sequence of digits displayedby the object of interest in the second input image; and map a pluralityof third points in the reference image to a respective plurality offourth points in the second input image so that the one or moreannotations in the reference image are projected onto the object ofinterest in the second input image, the training of the machine learningmodel being further based on the mapping of the plurality of thirdpoints to the plurality of fourth points.